The purpose of this markdown document is to list the steps we followed for refining the models.
Below we compare models using canopy symptoms as the response variable
Lets add summaries for how many variables each of these datasets provided
Note there may be a third soils dataset to incorporate. Also, need to confirm the normals data is actually the latest normal data.
The data used in the below models are described in the Data Wrangle folder.
There are multiple methods to group the response variables deepening on desired resolution or fineness of the model.
For now, we can move forward with the binary response grouping because it is the broadest and easiest for the model to classify with.
Filter trees to only those with soils data (Oregon and Washington)
All tree health categories
## # A tibble: 11 × 2
## # Groups: field.tree.canopy.symptoms [11]
## field.tree.canopy.symptoms n
## <fct> <int>
## 1 Branch Dieback or 'Flagging' 19
## 2 Browning Canopy 19
## 3 Extra Cone Crop 2
## 4 Healthy 403
## 5 Multiple Symptoms (please list in Notes) 17
## 6 New Dead Top (red or brown needles still attached) 33
## 7 Old Dead Top (needles already gone) 83
## 8 Other (please describe in Notes) 8
## 9 Thinning Canopy 118
## 10 Tree is dead 37
## 11 Yellowing Canopy 10
We also need to filter the data to only include response and explanatory variables we’re interested in. For example, whether a sound clip was included in the iNat data is not important.
We also need to remove other response variables like “field.percent.canopy.affected….” so it is not used as a predictor for tree health.
Note it might be interesting to know if the user was an important factor in predicting if the tree is healthy/unhealthy.
There are also a number of factors that should probably be removed because they may be biasing the data. For example, only trees with the ‘other factor’ question may only be answered for unhealthy trees. We need to think about this a bit more.
Remove variables with variables that have near zero standard deviations (entire column is same value)
We continue to get the below error, but were able to work around it by imputing the data.
Error in randomForest.default(m, y, …) : Need at least two classes to do classification.
To impute the data we have to remove factors with >53 levels.
The below code lists the number of levels for the variables that are factors.
Imputed data table
## ntree OOB 1 2 3 4 5 6 7 8 9 10 11
## 300: 46.06% 94.74% 94.74%100.00% 16.87% 76.47% 87.88% 71.08%100.00% 72.88% 91.89%100.00%
## ntree OOB 1 2 3 4 5 6 7 8 9 10 11
## 300: 45.79% 94.74% 94.74%100.00% 15.63% 82.35% 87.88% 72.29%100.00% 72.03% 97.30%100.00%
## ntree OOB 1 2 3 4 5 6 7 8 9 10 11
## 300: 45.39% 94.74%100.00%100.00% 14.39% 76.47% 87.88% 74.70%100.00% 72.88% 94.59%100.00%
## ntree OOB 1 2 3 4 5 6 7 8 9 10 11
## 300: 46.73% 94.74% 94.74%100.00% 16.38% 76.47% 87.88% 75.90%100.00% 74.58% 94.59%100.00%
## ntree OOB 1 2 3 4 5 6 7 8 9 10 11
## 300: 45.93%100.00% 94.74%100.00% 15.14% 82.35% 87.88% 75.90%100.00% 72.03% 94.59%100.00%
## ntree OOB 1 2 3 4 5 6 7 8 9 10 11
## 300: 46.06% 94.74% 94.74%100.00% 16.13% 76.47% 87.88% 73.49%100.00% 72.03% 97.30%100.00%
Binary tree health categories
## # A tibble: 2 × 2
## # Groups: field.tree.canopy.symptoms [2]
## field.tree.canopy.symptoms n
## <fct> <int>
## 1 Healthy 403
## 2 Unhealthy 346
##
## Call:
## randomForest(formula = field.tree.canopy.symptoms ~ ., data = binary, ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 23
##
## OOB estimate of error rate: 26.44%
## Confusion matrix:
## Healthy Unhealthy class.error
## Healthy 303 100 0.248139
## Unhealthy 98 248 0.283237
##
## Call:
## randomForest(formula = field.tree.canopy.symptoms ~ ., data = monthless.binary, ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 15
##
## OOB estimate of error rate: 28.7%
## Confusion matrix:
## Healthy Unhealthy class.error
## Healthy 293 110 0.2729529
## Unhealthy 105 241 0.3034682
##
## Call:
## randomForest(formula = field.tree.canopy.symptoms ~ ., data = normal.monthless.binary, ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 13
##
## OOB estimate of error rate: 29.51%
## Confusion matrix:
## Healthy Unhealthy class.error
## Healthy 284 119 0.2952854
## Unhealthy 102 244 0.2947977